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ABSTRACT 



An automated method of creating or updating a database of 
resumes and related documents, the method comprising, 

a) entering at least one example document that is relevant 
to a subject taxonomy in a retrieval priority list, if there 
is a plurality of example documents stored in the 
retrieval priority list, ranking the example documents 
according to the relevancy of the example documents to 
the subject taxonomy; 

b) retrieving a document from a network of documents, 
where the document is the most relevant document to 
the subject taxonomy stored in the retrieval priority list; 

. c)- harvesting information from specified fields of the 

document; 

d) classifying the information into one or more classes 
according to specified categories of the subject tax- 
onomy; 

e) storing the information into a database; 

f) determining whether the information are links to other 
documents; 

g) ranking the link's according to relevancy to the subject 
taxonomy, and storing the links in the retrieval priority 
list according to the relevancy; 

h) terminating the method, provided the method's stop 
criteria have been met; and 

i) repeating steps b) through h), provided the method's 
stop criteria has not been met. 



10 Claims, 15 Drawing Sheets 
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DEVICES AND METHODS FOR 
GENERATING AND MANAGING A 
DATABASE 



FIELD OF THE INVENTION 

This invention relates generally to computer network data 
operations and, more particularly, to an apparatus for gen- 
erating and updating databases for the retrieval of informa- 
tion. 

BACKGROUND OF THE INVENTION 

The Internet is a vast collection of documents that is 
accessible to the greatest number of users in the world. The 
Internet is constantly in flux, as new documents are added, 
and older documents are removed. The documents are 
typically written in hypertext mark-up language (HTML) 
and can include a mixture of text, graphic, audio and video 
elements. These documents comprise what is referred to as 
the "World Wide Web" and are also called web pages. 
Internet users can utilize a wide variety of Internet search 
engines that can be accessed with web browsers to locate 
and retrieve web pages that provide useful information. A 
user provides a search query, usually a string of words on a 
topic of interest, to a search engine, which then applies the 
search query to a database of web pages. Links to matching 
pages are returned to the user, typically ranked accordingly 
to a similarity score. Some of the currently popular search 
engines include "Alta Vista™", "Lycos™", "Yahoo™", 
"Google™" and "Infoseek™". 

The database searched by each search engine is usually a 
proprietary database, created by the search engine operator. 
Often, the search engine database comprises a reverse- 
lookup table of individual words with links to the web 
documents in which they are found. A web page that 
contains multiple instances of the words in a search query 
has a higher similarity score than a web page that contains 
fewer words from the search query. Likewise, a web page 
that contains all the words from a search query will have a 
higher similarity score than a page that does not contain all 
the words from the search query. Although this type of 
matching will generally lead to valid results, such search 
techniques can locate a fair amount of duplicate and irrel- 
evant documents. 

Most search engines rely on programs called "crawlers" 
or "spiders" that search the Internet for new documents that 
are made accessible to Internet users by storage at a web 
server computer. The contents of such documents are read 
for their word content, and links to these documents (their 
Internet addresses) are automatically added to the reverse 
look-up database of the search engine. Alternatively, humans 
can review the documents and make a determination of 
categories into which the documents should be indexed. The 
search engine database is then modified to include the 
reviewed documents, so that links are inserted into the 
database according to the categories decided upon. In this 
way, the respective search engines include virtually all of the 
documents that may be found on the web. 

Users can then access the search engine and provide a 
query. The search engine applies the query against the 
database and returns matches to the user. Unfortunately, the 
search results can easily become over-inclusive and return 
irrelevant links. For example, a search for information on 
North American wildlife may return links to discussions of 
stock market "bulls" and "bears". A search for Java™ 
programming developments may return links to coffee 
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houses. This type of over-inclusion requires reviewing the 
search results and discarding the links that are identified as 
irrelevant, which can be a very inefficient use of time. As the 
number of links to the web increases, an over-inclusive 

5 search can result in inadvertent obfuscation rather than 
elucidation of the sought after relevant information. 

One way to increase the relevancy of Internet documents 
located by a search engine is to limit the breadth of the 
search that is conducted. For example, a search may be 

10 limited to web pages found at a particular web site or 
Internet domain name. This technique works well if one is 
searching only for a web page at a particular site. The 
technique is not particularly useful if a more generalized 
subject matter search is desired, as the search will then be 

15 under-inclusive and many relevant documents will be 
missed. 

Aside from being an ever growing repository for 
information, .the Internet environment, and the World Wide 
Web, in particular, has become a nexus for commercial 

20 activity. A key factor for commercial success in the Internet 
environment is the ability of a web site to attract the web 
surfer. Recent trends and activity have seen development of 
a business strategy based on Vertical Portals. A Vertical 
Portal or "vortal" is a web site that is focused to a specific 

25 topic or several topics. The commercial advantage of such a 
site is that it provides the web advertiser with a narrow and 
well defined audience to which it can present its products 
and/or services. The commercial success of vortals, such as, 
ZD Net™ and eTrade™, have demonstrated the viability of 

30 

this strategy. One of features that attract the defined audience 
to continually return to a vortal is often the accessibility of 
a database that focused on a specific area of interest. Vortals 
are increasingly receiving more traffic and repeat trafBc, 
demonstrating that users are indeed in search of better, more 

35 relevant information. Further indication of the success of 
vortals is their ability to attract and charge higher, advertis- 
ing rates, due to their well-defined audience. New vertical 
portals are projected to launch in vast number in the future. 

40 From the discussion above, it should be apparent that 
there is a need for a database search technique that will 
provide relevant search results without unduly limiting the 
scope of the search. In addition, with the increasing number 
of vortals and commercial enterprises on the web there is a 

45 continuing need for an efficient method of generating and 
managing online databases. The present invention fulfills 
these needs and others. 

SUMMARY OF THE INVENTION 

50 An automated method of creating or updating a database 
of resumes and related documents, the method comprising, 

a) entering at least one example document that is relevant 
to a subject taxonomy in a retrieval priority list, if there 
is a plurality of example documents stored in the 

55 retrieval priority list, ranking the example documents 
according to the relevancy of the example documents to 
the subject taxonomy; 

b) retrieving a document from a network of documents, 
6Q where the document is the most relevant document to 

the subject taxonomy stored in the retrieval priority list; 

c) harvesting information from specified fields of the 
document; 

d) classifying the information into one or more classes 
65 according to specified categories of the subject tax- 
onomy; 

e) storing the information into a database; 
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f) determining whether the information are links to other 
documents; 

g) ranking the link's according to relevancy to the subject 
taxonomy, and storing the links in the retrieval priority 
list according to the relevancy; 

\ h) terminating the method, provided the method's stop 
criteria have been met; and 
i) repeating steps b) through h), provided the method's 
' stop criteria has not been met. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a computer network, such as 
the Internet, over which documents are processed to create 
a database that can be searched to identify relevant docu- 
ments. 

FIG. 2 is a flow diagram that illustrates the operations 
performed in utilizing the system illustrated in FIG. 1. 

FIG. 3 is a block diagram representation of a computer in 
the FIG. 1 system. 

FIG. 4 is a block diagram representation of the organiza- 
tion of the Back-End component illustrated in FIG. 1. 

FIG. 5 is a flow diagram that illustrates the processing 
performed by the Back-End component of FIG. 1. 

FIG. 6 is a flow diagram that illustrates the processing 
performed by Harvester. 

FIG. 7 A is a flow diagram that illustrates the processing 
performed by a Harvester using a Model Builder module. 

FIG. 7B is a flow diagram that illustrates the processing 
performed by a Classifier using a Model Builder module. 

FIG. 8 is a flow diagram that illustrates the operations 
performed by the Classifier module of the Back-End com- 
ponent illustrated in FIG. 4. 

FIG. 9 is a block diagram representation of the organiza- 
tion of the Front-End component illustrated in FIG. 1. 

FIG. 10 is a flow diagram that illustrates the processing 
performed by the Front-End component for a user accessing 
an Database illustrated in FIG. 1. 

FIG. 11 is a flow diagram that illustrates the processing 
performed by the Front-End component for a user/client 
accessing Back-End component illustrated in FIG. 1. 

FIG. 12 is a block diagram that illustrates applications and 
files in the Front End and Back End components that enable 
management of client database files. 

FIG. 13 is an example of a display from the Client 
Interface application, which shows taxonomy, and resource 
information from a client database. 

FIG. 14 is an example of a portion of the Client Interface 
application, which shows information about any specified 
directory and the resources that are classified within the 
directory. 

DETAILED DESCRIPTION 
Terms and Definitions 

As used herein the term, "network of documents" refers 
to a body or collection of documents, such as, the Internet, 
the World Wide Web, local area networks (LANs), intranets, 
and the like. 

As used herein the term, "documents" refers to informa- 
tion that is accessible from a network of documents, such as, 
web pages, web documents, and the like. Those of ordinary 
skill would be familiar with the above types of documents, 
and appreciate the applicability of the present invention to 
other like documents. 

As used herein the terms, "information", "links", or 
"resource links" refers to data contained in documents. For 
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example, data include but are not limited to the following 
forms, data maybe textual found in various formats, such as, 
ASCII text, HTML ("links"), XML or the like. The data may 
also be in the form of a graphics file found in various graphic 

5 file formats, such as, JPG, BMP, TIF or the like; or the data 
may also be in the form of a multimedia file, such as, AVI, 
MPEG, MOV or the like; or the data may also be in the form 
of an audio file, such as, WAV, MP3 or the like. Those of 
ordinary skill would be familiar with the above types of data, 

1Q and appreciate the applicability of the present invention to 
other like forms of data. 

As used herein the terms "spider" or "crawler" refers to a 
sequence of computer commands in the form of a computer 
program, subroutine or the like, that locate and retrieve 
documents according to specified criteria from a network of 

35 documents, such as, the Internet, the World Wide Web, 
LANs, intranets, or the like. 

As used herein the term "harvester" refers to a sequence 
of computer commands in the form of a computer program, 
subroutine or the like, that extracts information from a 

20 document. The information is extracted from pre-specified 
fields in the document. 

As used herein the term "Harvester Content Type Model" 
refers to a model that directs the Harvester as to the fields in 
a type of document to extract. Harvester Content Type 

25 Models are developed by an automated machine learning 
routine based on training sets of documents that exemplify 
the type of document that is to be harvested. For example, 
a Harvester Content Type Model for harvesting information 
from resumes could direct the Harvester to locate and extract 

30 information from the fields in the document corresponding 
to the name of the individual, the address, educational 
background, and commercial background. 

As used herein the term "classifier" refers to a sequence 
of computer instructions in the form of a computer program, 

35 subroutine or the like, that classifies information according 
to a specified taxonomy. 

As used herein the term "Classifier Content Type Model" 
means that provides the classifier with a model taxonomy 
from which extracted information is automatically assem- 

40 bier into a taxonomy. The extracted information can be 
automatically assigned into a database, or alternatively may 
be reviewed prior to assignment. For example, a Classifier 
Content Type Model for classifying extracted information 
from resumes could determine the appropriate category to 

45 store certain information, such as, whether the information 
is related to academic background, work experience, or 
personal information. 

. As used herein the term "Directed Graph Cluster Module", 
refers to a sequence of computer instructions in the form of 

50 a computer program, subroutine or the like, that determine 
relevancy of a link to a specific topic according to the 
number of other links related to said specific topic that 
referred to it. For example, typically a link that has a greater 
number of links linked to it that are also relevant to said 

55 subject topic, is construed to be of high relevance to the 
subject topic. 

As used herein the term "subject taxonomy" means to a 
subject area for which information is gathered and catego- 
rized. 

60 As used herein the term "example document" or "example 
documents" refer to documents provided as examples of the 
type of information that is being sought. Typically, example 
documents are used to aid the Harvester in selecting the most 
relevant Harvester Content Type Model, and the Classifier in 

65 selecting the most relevant Classifier Content Type Model. 
As used herein the term "Retrieval Priority List" refers to 
repository of hypertext links, URL addresses, or the like, 
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used in retrieving documents from a network of documents, One particular aspect of this embodiment is where said 
such as, the Internet, World Wide Web, LANs, intranets, or method is used to create or update a database of publicly 
the like. In the present invention, the contents of the available resumes retrieved from a network of documents. 
Retrieval Priority List are ranked according to their rel- For example, said example documents are resumes of indi- 
evance to a subject taxonomy. After each retrieved document 5 viduals with one or more desired attributes, such as, tech- 
is harvested and classified, information from the document mcaJ expertise, years of work experience in an industry, 
that is identified as links are added to the Retrieval Priority academic training (type of degree, institution where degree 
List according to their relevance sto the subject taxonomy. In awarded> &9dc point average> e tc). Such a database is 
tins way, the Retrieval Priority List is dynamic, it is always ticularl advantage0 us to any individual or entity that 
directing the Spider to retrieve the most relevant document ..„*,. t j 4 f ■ *• j i . . * j *i_ * 
■ j .-e j u 4 l * • , i~i i 10 desires to identify individuals who have posted their 
identified by the process at any given moment. For example . , r . • • 

in the instance where the spider is retrieving publicly resu u meS ° D a ° f d l ocuments Wlth s P ecific at nbutes > 

available resumes and publications from the web, a retrieved such as ' a P art ^ ar technical expertise, or employment 

resume may provide a link to a publication directed to experience, as possible employment candidates. In addition, 

subject matter that is relevant to the position that is to be the P resent invention also provides a method of entering into 

filled. Evaluation of the publication may show that the 15 me database resumes that are directly submitted to the user, 

potential co-authors are equally or more desirable Another aspect of the present invention is to include in the 

candidates, in which case the resumes for these individuals database documents, such as publications, articles and the 

may also be sought from the Internet. like, relating to certain resumes that are relevant to a desired 

As used herein the term "stop criteria" refers to any single attribute. This is especially beneficial because it provides the 

or set of conditions, which would single the termination of 20 user with additional information about a candidate prior to 

the method of the present invention. Typical stop criteria, contacting the potential candidate. 

include but are not limited to the following conditions, the Prior to the present invention, creating and updating a 

method having retrieved, harvested and classified a certain database of resumes was particularly labor intensive, time 

number of documents, the method having runs for a speci- consuming and inaccurate, if done at all. Entries to the 

fied amount of time, the method having retrieved a specified 25 database had to be entered manually, or scanned into a 

number of documents at a specified level of relevancy to a digital format, converted to a textual context format by an 

subject taxonomy! optical character recognition application and then entered 

As used herein the term "resume" or "curriculum vitae" into the database. Since there is no single uniform format for 

refers to a document that contains typically, information a resume, each resume must be reviewed by a human who 

relevant to an individual's work and/or educational experi- 30 identifies areas of interest, or likely interest for the user. This 

ence. Such documents are typically used to outline an information is then entered into a category in the database, 

individual's qualifications for a position. Within the context If more than one person is reviewing the resumes then there 

of this term, the term "and related documents" refers to is the possibility that variability in categorization can occur 

documents that provide additional information regarding an due to differences in reviewer interpretation, 

individuals qualifications and expertise in a particular area, 35 Recently, there have emerged a number of Internet busi- 

e g., journal articles, or publications. ness entities ("e-businesses") that provide recruiting ser- 

The present invention provides an automated method and vices to companies by posting job listings for a fee, such 

device for creating, updating, accessing and managing data- as,"Mo nster.com™" and "Hotjobs.com™". A statistic used 

bases. An embodiment of the present invention provides an by these e-businesses to entice client companies is the 

automated method of creating or updating a database the 40 number of resumes that are available in their databases. A 

method comprising, large number of available resumes translates to a larger pool 

a) entering at least one example document that is relevant of potential candidates available to the e-businesses' clients, 
to a subject taxonomy in a retrieval priority list, if there Resumes are typically submitted to these e-businesses 
is a plurality of example documents stored in said through the web. However, the submission process is not 
retrieval priority list, ranking said example documents 45 efficient. 

according to the relevancy of said example documents For example in the case of Monster.com, the resume 

to said subject taxonomy; submitter is not able to simply supply a copy of his or her 

b) retrieving a document from a network of documents, resume, rather, the information from the resume must be 
where said document is the most relevant document to enter into the Monster.com system according to specified 
said subject taxonomy stored in said retrieval priority 50 fields - If tne resume submitter wanted to submit his or her 
list; resume to another e-business soliciting for resumes, this 

c) harvesting information from specified fields of said P roce ss would have to be repeated. Therefore, the prior 
document' method of submitting a resume to an e-business is cumber- 

d) classifying said information into one or. more classes ^ and presents a barrier 10 "H"*** resulnes for these 

" e-businesses. 



according to specified categories of said subject tax- 
onomy; 



The present invention provides a method of taking advan- 



j-r ■ . _, . . tage of the increasing number of pubhcly available resumes 

e) storing said information into a database; #u °. ♦ j . *u T * . tt. . • 

/ . . . , t . , . _ • that are posted to the Internet, The present invention pro- 

0 determining whether said information are links to other vides a method of crealing and updating a database based Qn 

documents, 60 SJXC ^ resumes and re l a ted documents. In this manner, the 

g) ranking said link's according to relevancy to said number of resumes in the database of such an e-business ' 
subject taxonomy, and storing said links in said cou i d be increased dramatically, and kept up to date auto- 
retneval priority list according to said relevancy; matically. These are features that are of value the 

h) terminating said method, provided said method's stop e-businesses' clients. Another aspect of the present provides 
criteria have been met; and 65 a method of reviewing the identified resume to insure that 

i) repealing steps b) through h), provided said method's the individual posting the resume desires to be contacted. In 
stop criteria has not been met. some instances, individuals posting their resume indicate 
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that they do not want to be contacted, in such cases, the 
resume would be tagged by the present invention such that 
the user does not contact said individual. Another aspect of 
the present invention provides for e-mail notification to the 
resume poster that their resume and related documents have 5 
been identified from a network of documents, such as, the 
Internet, a local area network or an intranet, as being of 
potential interest to an employer, and permission is 
requested from the resume poster prior to making this 
information available to said potential employer. 10 

The present invention provides the user with many 
benefits, including but not limited to the following. For 
example, the method is automated and performed by com- 
puters. Therefore, the databases can be automatically created 
and updated in any desired time frame. Also because the 15 
process is performed by computers, the resultant database is 
more uniform and consistent since the extraction and clas- 
sification process is free of variation from human interpre- 
tation. In the present invention, categorization of extracted . 
information is performed by the classifier according to 20 
consistent and set regime. This is an important feature 
because as long as the categorization process is consistent, 
all information that is consistent with the classifier content 
model type is assigned to the same category. Therefore, even 
if the information assigned to a less than optimal category, 25 
the information can still be located. 

A database of resumes of potential job candidates created 
and updated by the present invention is very useful. Without 
such a database, the process of tracking and prioritizing 
resumes is reduced to the ability of the user to manually 30 
manage this body of information. Having such a database 
allows the user to easily review and prioritize potential 
candidates according to various user desired attributes by 
submitting and refining search parameters according to the 
results. For example, in the instance where the database 35 
returns an insufficient number of candidates for desired 
attribute, the database search can be easily broaden to 
increase the number of potential candidate. Conversely, if a 
search returns a large number of potential candidates, the 
search can be narrowed to decrease the number. In both 40 
cases, the database is a tool for providing the user with an 
optimal number of resumes to review. In addition, the 
present invention provides a feature where additional infor- 
mation about the resume submitter may be also collected 
and made available to the user for review. The user is thus 45 
able to identify a fist of potential candidates and evaluate 
aspects of their qualification through additional publicly 
available documents before making a decision whether to 
contact said resume submitter. Another aspect of the present 
invention provides for the incorporation into the database of 50 
resumes and other relevant documents that are received in 
paper form. This is done using known applications for 
reducing such paper documents into a digital format, such as 
scanning .and the like, and further converting such digital 
data into a textual content format by utilizing optical char- 55 
acter recognition applications or the like. 

Another aspect of the present invention provides an 
automated method of managing the recruitment functional- 
ity for a business organization by creating and updating a 
database of resumes. Such a database contains resumes 60 
located and retrieved from the Internet, as well as, resumes 
that have been submitted through traditional methods, e.g., 
by mail, fax or delivered by hand. The present invention is 
of use to traditional human resources or recruiting depart- 
ments in a company. The present invention is of particular 65 
use for companies or individuals in organizations that lack 
the size or infrastructure to support a traditional human 
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resource functionality. In this instance, the user is typically 
the manager of a department or a work group seeking to 
identify a qualified candidate for a position without the 
support of a human resource functionality. In certain 
instances, a company may want to utilize such a database in 
lieu of support from a human resource group for identifying 
possible candidates, since the present invention would per- 
mit the person who is seeking the new employee to do the 
selection. This provides added efficiencies to the process 
because the person who is requesting the new employee is 
typically best positioned to determine whether the candidate 
has the requisite technical skill, and personal style needed 
for the position being filled. The efficiencies in managing the 
recruitment process provided by the present invention 
results also makes the human resource functionality more 
efficient. Utilizing a database created and updated by a the 
present invention, results in less time and effort required to 
identify and keep track of possible candidates. Thereby 
allowing human resource departments to address other 
human resource areas of responsibilities, such as, benefits, 
employee morale, and the like. 

Another embodiment of the present invention provides an 
automated method of creating or updating a database, said 
method comprising the steps for, 

a) a step for training a spider to retrieve relevant docu- 
ments to example documents from a network of docu- 
ments; 

b) a step for retrieving said relevant documents from said 
network of documents; 

c) a step for extracting information from said retrieved . 
relevant documents; 

d) a step for classifying said extracted information; 

e) a step for storing said extracted information into a 
database; 

f) a step for determining whether said information are 
links to other documents; 

g) a step for ranking said links according to relevancy to 
said taxonomy, and storing said links in said retrieval 
priority list according to said relevancy; 

h) a step for terminating said method, provided that said 
method's stop criteria have been met; and 

i) repeating steps b) through h), provided said method's 
stop criteria has not been met. One particular aspect of 
this embodiment is where said database, is a database 
of resumes. 

One particular aspect of the present embodiment is where 
the act of harvesting information from specified fields is 
according to a Harvester Content Type Model. For example, 
a Harvester Content Type Model can be developed to locate 
and extract field of information that is of interest to a 
potential employer or recruiter ("user"), such fields include 
but are not limited to the following information about the 
potential candidate, name, address, phone number, e-mail 
address, career objective, educational history (e.g., allowing 
for multiple records with information relating to degree 
awarded, subject, grade point average, date of graduation, 
honors awarded, school and location of school), employ- 
ment experience (e.g., allowing for multiple records with 
information relating to duration of work, employer, position, 
location of company, salary history, skills used, skills 
developed, and accomplishments), salary desired, skills/ 
qualifications, personal interests, and references. 

Another particular aspect of the present embodiment is 
where the act of classifying the information is according to 
a Classifier Content Type Model. Yet another aspect of the 
present embodiment is where the act of determining the 
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link's relevancy to the subject taxonomy is determined 
according to a Classifier Content Type Model. For example, 
typical types of categories of information that may be of 
interest, include but are not limited to the exemplary fields 
of information extracted by the harvester, as previously 5 
taught above. Still another aspect of the present embodiment 
is where the act of determining the link's relevancy to the 
subject taxonomy is determined according to a Directed 
Graph Cluster Module. 

Another embodiment of the present invention is a method 10 
of locating a document or set of documents in a database 
relevant to a topic, the method comprising, 

a) an act of receiving a topic; 

b) an act of applying the topic to the subject taxonomy of 
the database created from a system that generates the 15 
database by performing a method comprising: 

c) entering at least one example document that is relevant 
to a subject taxonomy in a retrieval priority list, if there 
is a plurality of example documents stored in said . 
retrieval priority list, ranking said example documents 20 
according to the relevancy of said example documents 

to said subject taxonomy; 

d) retrieving a document from a network of documents, 
where said document is the most relevant document to 25 
said subject taxonomy stored in said retrieval priority 
list; 

e) harvesting information from specified fields of said 
document; 

f) classifying said information into one or more classes 30 
according to specified categories of said subject tax- 
onomy; 

g) storing said information into a database; 

h) determining whether said information are links to other 
documents; 35 

i) ranking said link's according to relevancy to said 
subject taxonomy, and storing said links in said 
retrieval priority list according to said relevancy; 

j) terminating said method, provided said method's stop 

criteria have been met; and 40 
k) repeating steps d) through j), provided said method's 

stop criteria has not been met. 
Another embodiment of the present invention provides a 
computer system for creating or updating a database, the 
computer system comprising, 45 

a) a central processing unit that can establish communi- 
cation with the network; and 

b) program memory that stores programming instructions 
that are executed by the central processing unit such 5Q 
that the computer system executes a method 
comprising, 

c) entering at least one example document that is relevant 
to a subject taxonomy in a retrieval priority list, if there 

is a plurality of example documents stored in said 55 
retrieval priority list, ranking said example documents 
according to the relevancy of said example documents 
to said subject taxonomy; 

d) retrieving a document from a network of documents, 
where said document is the most relevant document to 60 
said subject taxonomy stored in said retrieval priority 
list; 

e) harvesting information from specified fields of said 
document; 

f) classifying said information into one or more classes 65 
according to specified categories of said subject tax- 
onomy; 
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g) storing said information into a database; 

h) determining whether said information are links to other 
documents; 

i) ranking said link's according to relevancy to said 
subject taxonomy, and storing said links in said 
retrieval priority list according to said relevancy; 

j) terminating said method, provided said method's stop 

criteria have been met; and 
k) repeating steps d) through j), provided said method's 

stop criteria has not been met. 
Another embodiment of the present invention provides a 
program product for use in a computer system that executes 
program steps recorded in a computer-readable media to 
perform a method of creating or updating a database, the 
method comprising, 

a) entering at least one example document that is relevant 
to a subject taxonomy in a retrieval priority list, if there 
is a plurality of example documents stored in said 
retrieval priority list, ranking said example documents 
according to the relevancy of said example documents 
to said subject taxonomy; 

b) retrieving a document from a network of documents, 
where said document is the most relevant document to 
said subject taxonomy stored in said retrieval priority 
list; 

c) harvesting information from specified fields of said 
document; 

d) classifying said information into one or more classes 
according to specified categories of said subject tax- 
onomy; 

e) storing said information into a database; 

f) determining whether said information are links to other 
documents; 

g) ranking said link's according to relevancy to said 
subject taxonomy, and storing said links in said 
retrieval priority list according to said relevancy; 

h) terminating said method, provided said method's stop 
criteria have been met; and 

i) repeating steps b) through h), provided said method's 
stop criteria has not been met. 

Another embodiment of the present invention provides a 
method of locating a document or set of documents in a 
database relevant to a topic, the method comprising the steps 
of, 

a) a step for receiving a topic; 

b) a step for applying the topic to the subject taxonomy of 
the database created from a system that generates the 
database by performing a method comprising: 

c) a step for training a spider to retrieve relevant docu- 
ments to example documents from a network of docu- 
ments; 

d) a step for retrieving said relevant documents from said 
network of documents; 

e) a step for extracting information from said retrieved 
relevant documents; 

f) a step for classifying said extracted information; 

g) a step for storing said extracted information into a 
database; 

h) a step for determining whether said information are 
links to other documents; 

i) a step for ranking said links according to relevancy to 
said taxonomy, and storing said links in said retrieval 
priority list according to said relevancy; 
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j) a step for terminating said method, provided that said 

method's stop criteria have been met; and 
k) repeating steps d) through j), provided said method's 

stop criteria has not been met. 
A system constructed in accordance with the invention 5 
creates a database by placing a starting document into a 
retrieval priority list. The document is compared with a 
subject taxonomy and is then harvested by determining a 
category into which the document will be placed, wherein 
the category is specified by a taxonomy of subject catego- 10 
ries. The document is next classified into one or more classes 
within the taxonomy categories and a database entry is 
generated that points from the classes to the document. 
Either the single document can be harvested, or all docu- 
ments at a common domain or web site may be queued and is 
harvested in this manner. For each document harvested, the 
system further processes each document by determining 
links in the document that point to other documents of the 
network (even if in other domains) and by adding these 
linked documents to the processing queue. The linked docu- 20 
ments in the processing queue are then processed by repeat- 
ing the steps of retrieving, harvesting, and classifying. 

An embodiment of the present invention is a method of 
creating a database of documents for query searching, the 
method comprising, 25 
retrieving a starting document located at a network 

address into a retrieval processing queue; 
comparing the document with a subject taxonomy; 
harvesting information from specified fields in said docu- 
ment that is relevant to said subject taxonomy; 30 
classifying the document into one or more classes within 

the taxonomy category; 
storing the document into an index comprising links from 

the classes to the starting document; 35 
determining links in the document that point to other 

documents of the network; 
adding the linked documents to the data store processing 
queue; 

repeating the steps of comparing, harvesting, classifying, 40 
and determining for each linked document in the data 
store processing queue until a stopping criterion is 
reached. 

Another embodiment of the present invention provides a 
method of locating a document in a collection having 45 
relevance to a search query, the method comprising: 

receiving the search query; 

comparing terms of the search query to an database 
created from a system that generates the database by 
performing a method comprising: 50 
receiving a starting document located at a network 

address into a data store processing queue; 
comparing the document with a subject taxonomy; 
harvesting information from specified fields in said 

document that is relevant to said subject taxonomy; 55 
classifying the document into one or more classes 

within the taxonomy category; 
storing the document into an index comprising links 

from the classes to the starting document; 
determining links in the document that point to other 60 

documents of the network; 
adding the linked documents to the data store process- 
ing queue; 

repeating the steps of comparing, harvesting, 
classifying, and determining for each linked docu- 65 
ment in the data store processing queue until a 
stopping criterion is reached; and 
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returning finks to documents identified by the database 
as matching the search query terms. 
Another embodiment of the present invention provides a 
computer system for generating an database of a network 
document collection for searching, the system comprising: 
a central processing unit that can establish communication 

with the network; and 
program memory that stores programming instructions 
that are executed by the central processing unit such 
that the computer system establishes communication 
with the network and communicates with a network 
user, such that the computer system receives a starting 
document located at a network address into a data store 
processing queue of the computer system, comparing 
the document with a subject taxonomy, harvesting 
information from specified fields in said document that 
is relevant to said subject taxonomy, classifying the 
document into one or more classes within the taxonomy 
category, storing the document into an index compris- 
ing links from the classes to the starting document, 
determining links in the document that point to other 
documents of the network, adding the lined documents 
to the data store processing queue, repeating the steps 
of comparing, harvesting, classifying, and determining 
for each linked document in the data store processing 
queue until a stopping criterion is reached. 
Another embodiment of the present invention provides a 
program product for use in a computer system that executes 
program steps recorded in a computer-readable media to 
perform a method for processing a computer file request to 
retrieve a network data file comprising a web site page, the 
program product comprising: 
a recordable media; and 

a program of computer-readable instructions executable 
by the computer system to perform method steps com- 
prising: 

receiving a starting document located at a network 
address into a data store processing queue; 

comparing the document with a subject taxonomy; 

harvesting information from specified fields in said 
document that is relevant to said subject taxonomy; 

classifying the document into one or more classes 
within the taxonomy category; 

storing the document into an index comprising links 
from the classes to the starting document; 

determining links in the document that point to other 
documents of the network; 

adding the linked documents to the data store process- 
ing queue; 

repeating the steps of comparing, harvesting, 
classifying, and determining for each linked docu- 
ment in the data store processing queue until a 
stopping criterion is reached. 
Another embodiment of the present invention provides a 
method of managing a database maintained at a first com- 
ponent from a second component using an internet browser 
application, wherein the database is comprised of references 
developed from documents on the Internet, the method 
comprising, 

a) an act of initiating contact from a second component to 
a first component; 

b) an act of receiving status and content information at the 
second component transmitted from the first compo- 
nent; 

c) an act of transmitting management instructions from 
the second component to the first component; 
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d) an act of receiving updated status and content infor- 
mation transmitted from the first component; 

e) repeating acts b), c) and d), as desired; and 

f) an act of terminating contact from the second compo- 
nent to the first component at completion of manage- 5 
ment tasks. 

A particularly advantageous aspect of the present embodi- 
ment is where the contact is through the Internet, a local area 
network, or an intranet. 

Another particularly advantageous aspect of the present 30 
embodiment is where the management instructions are for 
the placement of documents into a taxonomy. 

Another embodiment of the present invention provides a 
computer system for managing a database maintained at a 
first component from a second component using an internet 15 
browser application, wherein the database is comprised of 
references developed from documents on the Internet, the 
system comprising, 

a) an act of initiating contact from a second component to 2Q 
a first component; 

b) an act of receiving status and content information at the 
second component transmitted from the first compo- 
nent; 

c) an act of transmitting management instructions from 25 
the second component to the first component; 

d) an act of receiving updated status and content infor- 
mation transmitted from the first component; 

e) repeating acts b), c) and d), as desired; and 

f) an act of terminating contact from the second compo- 
nent to the first component at completion of manage- 
ment tasks. 

Another embodiment of the present invention provides a 
program product for use in a computer system that executes 35 
program steps recorded in a computer-readable media to 
perform a method of managing a database maintained at a 
first component from a second component using an internet 
browser application, wherein the database is comprised of 
references developed from documents on the Internet, the 4Q 
method comprising, 

a) an act of initiating contact from a second component to 
a first component; 

b) an act of receiving status and content information at the 
second component transmitted from the first compo- 45 
nent; 

c) an act of transmitting management instructions from 
the second component to the first component; 

d) an act of receiving updated status and content infor- 
mation transmitted from the first component; 50 

e) repeating acts b), c) and d), as desired; and 

f) an act of terminating contact from the second compo- 
nent to the first component at completion of manage- 
ment tasks. 

Yet another embodiment of present invention provides a 
method of providing database management services to a 
database maintained at a first component from a second 
component using an internet browser application, wherein 
the database is comprised of references developed from 
documents on the Internet, the method comprising, 

a) an act of receiving initial contact at a first component 
from a second component; 

b) an act of transmitting status and content information to 
the second component from the first component; 

c) an act of receiving management instructions at the first 
component from the second component; 
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d) an act of transmitting updated status and content 
information from the first component to the second 
component following completion of the instructions by 
first component; 

e) repeating acts b), c) and d), as instructed; and 

f) an act of terminating contact with the second compo- 
nent when receiving such instructions from the second 
component. 

A particularly advantageous aspect of the present embodi- 
ment is where the contact is through the Internet, a local area 
network, or an intranet. 

Another particularly advantageous aspect of the present 
embodiment is where the management instructions are for 
the placement of documents into a taxonomy. 

Another embodiment of the present invention provides a 
computer system for providing database management ser- 
vices to a database maintained at a first component from a 
second component using an internet browser application, 
wherein the database is comprised of references developed 
from documents on the Internet, the method comprising, 

a) an act of receiving initial contact at a first component 
from a second component; 

b) an act of transmitting status and content information to 
the second component from the first component; 

c) an act of receiving management instructions at the first 
component from the second component; 

d) an act of transmitting updated status and content 
information from the first component to the second 
component following completion of the instructions by 
first component; 

e) repeating acts b), c) and d), as instructed; and 

f) an act of terminating contact with the second compo- 
nent when receiving such instructions from the second 
component. 

Another embodiment of the present invention provides a 
program product for use in a computer system that executes 
program steps recorded in a computer-readable media to 
perform a method of providing database management ser- 
vices to a database maintained at a first component from a 
second component using an internet browser application, 
wherein the database is comprised of references developed 
from documents on the Internet, the method comprising, 

a) an act of receiving initial contact at a first component 
from a second component; 

b) an act of transmitting status and content information to 
the second component from the first component; 

c) an act of receiving management instructions at the first 
component from the second component; 

d) an act of transmitting updated status and content 
information from the first component to the second 
component following completion of the instructions by 
first component; 

e) repeating acts b), c). and d), as instructed; and 

f) an act of terminating contact with the second compo- 
nent when receiving such instructions from the second 
component. 

Other features and advantages of the present invention 
should be apparent from the following description of the 
preferred embodiment, which illustrates, by way of example 
the principles of the invention. 

EXAMPLE 

FIG. 1 is a block diagram representation of a system 100 
for retrieving, extracting and categorizing, information, such 
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as, hypertext links from documents identified from a net- 
work of documents, such as, the World Wide Web, the 
internet, an local area network ("LAN"), an intranet or the 
like. The system provides access or information for access- 
ing just those documents located within a network docu- 
ments that are relevant to a given information need. A 
Back-End component 102 employs a classification scheme 
as implemented by the Spider, Harvester and Classifier to 
process documents from a network of documents 104 in 
order to place information from the documents into appro- 
priate nodes in a taxonomy. The classified information 
comprise a database 106. If desired, the database 106 can be 
stored at either the Back-End 102 or at a Front-End com- 
ponent 108, preferably at the Back-End. The Front-End 
component provides a convenient interface that is accessed 
by a user 110. The user provides an information need to the 
Front-End, which applies the information need against the 
database 106 to identify information, for example, hyper- 
texted links, to documents 104 that are relevant to the 
information need. The documents can then be retrieved by 
the user. In this way, documents from a network of docu- 
ments can be efficiently located, harvested, and classified, 
and thus provided for efficient retrieval. 

The system 100 can be implemented in a variety of 
configurations. For example, the Back-End component 102 
may comprise a primary service provider, who maintains the 
database 106 and provides access to the Front-End 108, 
which may comprise a secondary service provider, who 
charges access fees to user 110. Alternatively, the Front-End 
and Back-End may comprise a single point of access to users 
110. In the preferred embodiment, the network of documents 
104 comprise all the resources available over the Internet, 
including the "World Wide Web", LANs, intranets, or the 
like; and the Back-End component 102 and Front-End 
component 108 comprise separate computer systems that 
communicate with each other. The users 110 comprise 
networked computers who communicate with the Front-End 
and thereby gain access to the database 106 for searching 
and to the documents 104 for retrieving. Alternatively, all the 
computers can be implemented as a single computer having 
the various components 102, 108, 110, or the components 
can communicate over a local area network (LAN) or 
intranet. 

FIG. 2 is flow diagram that illustrates the operations 
performed in utilizing the system illustrated in FIG. 1. First, 
a taxonomy 202 is specified for a topic of interest. For 
example, it may be desired to create an database of resources 
relating to the "Java™" programming language. The tax- 
onomy comprises a hierarchy of titles or categories that 
specify an outline for a topic. Those skilled in the art will be 
familiar with the multiple ways in which a hierarchy may be 
represented for computer use, such as linked lists and tables. 
If the Front-End and Back-End are separate providers, then 
the taxonomy may be provided by either provider, or may be 
developed in joint consultation. In either case, the taxonomy 
is then used to build an database of resource links by 
crawling, harvesting, and classifying, as described further 
below. 

The building operation is represented by the flow diagram 
box numbered 204. After the database is completed, the next 
operation is to permit user access to the database for query 
matches that identify resources of interest. This step is 
represented by the flow diagram box numbered 206. In the 
final operating step represented by the flow diagram box 
numbered 208, users retrieve the resources identified by the 
resource links. Typically, the resource links will be the 
resource's URL, or hyperlinked text, or some other method 
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of accessing the document on the world wide web that are 
known to those of skill in the art. Other operations may then 
continue. From time to time, database maintenance may be 
performed, for example in order to update the database with 
5 new documents so that these new resources are available for 
retrieval. 

Computer Configuration 

FIG. 3 is a block diagram of an exemplary computer 300 
such as might comprise any of the computers of the Back- 

10 End component 102, the Front-End component 108, or the 
users 110. Each computer 300 operates under control of a 
central processor unit (CPU) 302, such as a "Pentium®" 
microprocessor and associated integrated circuit chips, 
available from Intel Corporation of Santa Clara, Calif., 

15 USA, A computer can input commands and data from a 
keyboard and mouse 304 and can view inputs and computer 
output at a display 306. The display is typically a video 
monitor or flat panel display device. The computer 300 also 
includes a direct access storage device (DASD) 307, such as 

20 a fixed hard disk drive. The memory 308 typically comprises 
volatile semiconductor random access memory (RAM). 
Each computer preferably includes a program product reader 
310 that accepts a program product storage device 312, from 
which the program product reader can read data (and to 

25 which it can optionally write data). The program product 
reader can comprise, for example, a disk drive, and the 
program product storage device can comprise, removable 
storage media such as a floppy disk, an optical CD-ROM 
disc, a CD-R disc, a CD-RW disc, DVD disk, or the like. 

30 Each computer 300 can communicate with the other con- 
nected computers over the network 313 through a network 
interface 314 that enables communication over a connection 
316 between the network and the computer. 

The CPU 302 operates under control of programming 

35 steps that are temporarily stored in the memory 308 of the 
computer 300. When the programming steps are executed, 
the pertinent system component performs its functions. 
Thus, the programming steps implement the functionality of 
the system components 102, 108 illustrated in FIG. 1. The 

40 programming steps can be received from the DASD 307, 
through the program product 312, or through the network 
connection 316. The storage drive 310 can receive a pro- 
gram product, read programming steps recorded thereon, 
and transfer the programming steps into the memory 308 for 

45 execution by the CPU 302. As noted above, the program 
product storage device can comprise any one of multiple 
removable media having recorded computer-readable 
instructions, including magnetic floppy disks, CD-ROM, 
and DVD storage discs. Other suitable program product 

50 storage devices can include magnetic tape and semiconduc- 
tor memory chips. In this way, the processing steps neces- 
sary for operation in accordance with the invention can be 
embodied on a program product. 

Alternatively, the program steps can be received into the 

55 operating memory 308 over the network 313. In the network 
method, the computer receives data including program steps 
into the memory 308 through the network interface 314 after 
network communication has been established over the net- 
work connection 316 by well-known methods that will be 

60 understood by those skilled in the art without further expla- 
nation. The program steps are then executed by the CPU 302 
to implement the processing of the system. 

It should be understood that all of the computers of the 
system 100 illustrated in FIG. 1 preferably have a construc- 

65 tion similar to that shown in FIG. 3, so that details described 
with respect to the FIG. 3 computer 300 will be understood 
to apply to all computers of the system 100. Any of the 
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computers can have an alternative construction, so long as 
they can communicate with the other computers and support 
the function ality described herein. 
The Back-End Component 

FIG. 4 is a block diagram representation of the organiza- 
tion of the Back-End component 102 illustrated in FIG. 1. 
FIG. 4 shows that the Back-End of the preferred embodi- 
ment includes a spider 402, Harvester 404, and Classifier 
406, an SHC Manager 410, a Directed Graph Cluster 
Module 412, a Monitor 414, and a Data store facility 416. 
With this architecture, the Back-End component 102 sup- 
ports multiple Front-End providers. More particularly, the 
Spider 402, Harvester 404, and Classifier 406 can operate 
independently. This provides easier support and mainte- 
nance for multiple Front-End 108 components, and the 
increased parallelism provides good scalability and accom- 
modation of high peak loads on the system. 

The SHC Manager 410 manages the operation of the 
Spider, Harvester and Classifier, and operates according to a 
cyclical schedule, periodically receiving jobs comprising 
requests for crawling, harvesting, and classifying documents 
from the world wide web for inclusion into a taxonomy. The 
job requests will come from a variety of Front-End providers 
who have arranged with the Back-End to create a database 
specified by their respective topic taxonomy. The SHC 
Manager periodically checks the Data store for job configu- 
ration data to determine currently running jobs, including 
the status of newly received jobs. The SHC Manager will 
select a predetermined number of job requests for process- 
ing. It is the function of the SHC Manager 410 to determine 
the tasks that need to be performed and to apportion tasks 
among the Spider 402, Harvester 404, and Classifier 406, 
The SHC Manager may temporarily store results of a job by 
a module ("upstream module'*) in the Data store, while 
waiting for the next module ("downstream") in the process 
to complete a pending task. When the "downstream" module 
is finished with its task, the next job is forwarded to it from 
Data store. This process allows the modules to operate in 
parallel, thereby increasing system efficiency. Those of skill 
in the art will appreciate that there can be a plurality of 
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The SHC Manager 410 receives the taxonomy category 
into which the Classifier 406 has placed a document and 
stores the extracted information in the corresponding tax- 
onomy category of the database being built. As noted above, 
the Spider 402 may identify many links from a page being 
processed and will provide these to the SHC Manager. The 
Classifier provides the category or categories into which a 
document should be classified. The SHC Manager adds links 
from the document that were extracted and classified to a 
retrieval priority list in the Data store 416. At the next 
iteration of the SHC Manager, when it next checks job 
configuration, those links will be among the links provided 
by the SHC Manager to the Spider, Harvester, and Classifier 
for processing. The SHC Manager 410 generates statistics 
on each job or documents processed, such as the number of 
links identified, the number of documents processed, the 
amount of processing time for the documents, as well as 
other statistics indicative of efficiency. 

It should be apparent that it is possible for the processing 
task to become larger and larger, as links are followed from 
the starting document to additional documents, and the links 
on those additional documents are, in turn, identified by the 
Spider 402 and are followed to more documents. 

Those of skill in the art are familiar with principles and 
regimes that may be applied in guiding the retrieval, extract- 
ing and classification of documents and information from a 
network of documents. Search regimes for problem solving, 
and heuristic search methods are discussed in Chapters 3 and 
4 of "Artificial Intelligence: A Modern Approach" Prentice 
Hall Series In Artificial Intelligence, 1995, Stuart J. Russell, 
and Peter Novig, incorporated herein by reference. 

The Directed Graph Cluster Module 412 provides parallel 
process to that of the Classifier. The Classifier assesses the 
relevancy of the retrieved document to the search topic 
according to the relevancy of the links contained in the 
document and the document. The Directed Graph Cluster 
Module assesses the relevancy of the document according to 
the number of links it has to other documents that are 
relevant to the search topic. A document that is relevant to 



module types, for example, multiple spiders, harvester, and 40 a given topic will be interconnected and referenced by other 



classifier, in the system. The plurality of modules further 
enhances the parallel operation of the system and enables it 
to process jobs quickly and efficiently. When the SHC 
Manager receives a job request, it receives a starting net- 
work address. For example, in the case of the Internet, the 
SHC Manager will receive a web site address, also referred 
to as the Uniform Resource Locator (URL). The URL is an 
Internet address where a web page can be found and 
indicates a starting URL for a web site (resource) to be 
processed by the system. 

The SHC Manager 410 takes each beginning URL and 
provides it to the Spider 402. The Spider examines each web 
page to determine the links it contains. Those skilled in the 
art will be aware that web pages that relate to a particular 
topic often contain links, which are pointers to additional 
web pages oh related topics. It is the function of the Spider 
to identify the links that are contained on a web page being 
processed, which the Spider receives from the SHC Man- 
ager. The Spider provides the identified links to the SHC 
Manager, which schedules further processing. The Harvester 
404 receives and extracts information from the contents of 
the pages. That is, text of the linked web page is assumed to 
be descriptive of the page contents, and is associated with 
the link itself. The Classifier 406 receives the descriptive 
text and processes it to determine the category in the 
taxonomy into which the linked web page is most closely 
associated. 
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similar documents, and this characteristic can be used to 
assess the document's relevancy. A further discussion of this 
process is found in the web based article, "Focused Crawl- 
ing: A New Approach to Topic-Specific Web Resource 
Discovery", Soumen Chakrabarti, Martin van den Berg, and 
Byron Dom, Mar. 29, 1999, 18:29, which was found at the 
website for the Computer Science Department, University of 
California, Berkeley. 

The Monitor 414 can provide a means of checking system 
operational status improving performance. For example, the 
Monitor can automatically halt Spider operations after a 
predetermined time limit, or can accept a Front-End user- 
defined halting criterion for stopping Spider or Harvester 
operation. 

FIG. 5 is a flow diagram that illustrates the processing 
performed by the Back-End component 102. In the first 
operation, represented by the flow diagram box numbered 
502, the SHC Manager of the Back-End component receives 
one or more starting links. As noted above, these starting 
links are pulled from the processing queue of the Data store 
416 and comprise either initial URLs submitted by a Front- 
End provider or URLs identified by the Spider 402. Next, the 
Spider receives the next link for processing. This step is 
represented by the flow diagram box numbered 504. The 
Spider then downloads the link by requesting the corre- 
sponding web page, as indicated by the flow diagram box 
numbered 504. The Harvester then processes the retrieved 
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document in the step represented by the flow diagram box 
numbered 506. The Harvester may extract one or more 
possible resources from a retrieved document. The next step, 
box 508 indicates that the Classifier processes the extracted 
resources from the Harvester to determine the appropriate 
taxonomy categories into which the resources should be 
placed. The SHC Manager then stores the web page link 
information into the taxonomy category for the database 
being built, in the processing operation indicated by the flow 
diagram box numbered 510. 

In the next operation, indicated by the box numbered 514, 
the extracted resource links are placed in the processing 
queue of the Data store according to their ranking. Next, at 
the decision box 516, the system checks to determine if a 
halting condition has been reached. If it has not, a negative 
outcome, then processing is continued with the next link at 
box 504, If a halting condition has been reached, an affir- 
mative outcome at the decision box 516, then link process- 
ing for the current web page is halted, and other system 
processing continues. 

Operation of Machine Learning Modules - 

FIG. la is a flow diagram that illustrates an example of a 
machine learning module used in the present invention to 
develop content type models. For example, the illustrated 
process is used to develop a content type library that is used 
in the Harvester module to direct the extraction of informa- 
tion from retrieved resources, such as, web documents, web 
pages, and the like. In the first process box 702, a set of 
sample documents, which exemplify the type of documents 
that are to be harvested are assembled. In process box 704, 
the documents are tagged to indicate the types of informa- 
tion that is to be extracted. For example, documents related 
to journal articles might have text fields such as, the author's 
name, the title of the article, the URL of the document, and 
like tagged. Another example where the documents are 
resumes, might have text fields such as, the name, address, 
technical expertise, relevant experience, education 
background, work background and the like tagged. Follow- 
ing the tagging the set of documents are arbitrarily divided 
into two sets. In the next step as illustrated in box 706, a test 
model is generated based on one set of documents. The test 
model to be used as a guide for the harvester in extracting 
information. In box 708, the test model is tested against the 
second set of documents for accuracy in retrieving the 
tagged fields. Since the second set of documents have the 
desired field tagged, the model accuracy can be readily 
determined. Box 710 illustrates the evaluation of the accu- 
racy of the model. If the model is sufficiently accurate, it is 
placed into a context type library for future use, as illustrated 
in box 714. If the accuracy is not sufficient, the model is 
refined 712 and re-tested against the second set of training 
materials, as illustrated in 708. 

FIG. lb is a flow diagram that illustrates an example of a 
machine learning module used in the present invention to 
develop content type models. For example, the illustrated 
process is used to develop a content type library that is used 
in the Classifier module to direct classification of harvested 
resources to taxonomy categories. In the first process box 
750, a set of example documents are assembled which 
exemplify the types of documents that are to be assigned to 
those categories. In process box 752 the module develops a 
test model of such a categorization scheme. In process box 
754, an additional set of pre-categorized documents are 
processed with the test model. In process box 756, the 
accuracy of the test model is reviewed. If the accuracy is 
sufficient, the model is placed into the Classifier Content 
Type Library for later use. If the accuracy is insufficient, the 
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model is revised and re-tested with the sample set of 
documents, as illustrated in process box 758. 
Operation of the Crawler 

FIG. 11 is a flow diagram that illustrates an example of a 
crawl initiated either to update an existing database, or to 
generate a new database. In the first processing operation, 
represented by the flow diagram box numbered 1102, the 
requestor of the crawl, either the client at the Front-End 
component, or the primary service provider at the Back-End, 
contacts the Back-End component which carries out a autho- 
rization process to ensure that the requester of the function 
has authorization to initiate such process, and that financial 
charges for the crawl are properly are recorded. In the next 
processing step, represented by the flow diagram box num- 
bered 1104, the Front-End component transmits to the 
Back-End component a request for a search, the search 
criteria, and a set of training materials exemplifying the 
types of documents desired for the database. Upon receiving 
the request, the Spider processes the resources using the 
Classifier to optimize the search, as represented in flow 
diagram box 1106. The resources are then placed into a 
retrieval priority list according to a ranking given by the 
Classifier. In the next step, as represented by flow diagram 
box 1108, the Spider retrieves a resource from the top of 
retrieval priority list. The retrieved resource is processed by 
the Harvester where property information is extracted from 
the resource, as represented by flow diagram box 1110. In 
the next step, as represented by flow diagram box 1112, the 
retrieved resource and the information extracted by the 
Harvester are organized according to taxonomy by the 
Classifier, or alternatively all or a sub-set of the resources 
can be stored into an area for client review prior to entry into 
the database. Referenced resource links are reviewed by the 
Classifier, as represented by flow diagram box 1114, and the 
retrieval priority list is updated accordingly. A check is made 
to determine if the stop criteria has been reached, as repre- 
sented by decision box 1116. If the criteria has not been met, 
the crawl resumes with the Spider retrieving the top most 
resource from the updated retrieval priority list, as repre- 
sented by flow diagram box 1108. If the criteria has been 
met, the requestor is notified, and may review the outcome 
of the crawl, as represented by flow diagram box 1118. If the 
requester is satisfied with the results of the crawl, the process 
is completed, as represented by decision box 1120. 
Alternatively, the requester can request another crawl. 
Before beginning the another crawl, the client may update 
the training materials, for example, with resources retrieved 
from the previous crawl, as represented by flow diagram box 
1122. In addition, the taxonomy may be revised as is deemed 
necessary by the requestor. The second crawl is then initi- 
ated and begins with the processing of the training materials 
as represented by flow diagram box 1106. 
Operation of the Harvester 

FIG. 6 is a flow diagram that illustrates the operations 
performed by the Harvester module of the Back-End com- 
ponent illustrated in FIG. 4. The Harvester receives 
resources retrieved from the world wide web by the spider, 
such as, web pages, web documents from the spider and the 
like. The Harvester module determines the type of document 
that has been retrieved according to a Content Type model 
selected from a Content Type Library, and then extracts 
information from specified fields according to the Content 
Type model. The extracted information is then passed on to 
the Classifier. 

The first operation of the Harvester module illustrated by 
flow diagram box number 602 is to format the document by 
converting the existing format of the document to one that 
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is recognized by the Harvester. For instance, the Harvester tion is indicated by the flow diagram box numbered 810. In 

may only recognize text in ASCII text format and the particular, for every page linking to the page being processed 

document may be in HTML format, in this case the docu- (that is, "incoming" links), a predetermined amount is added 

mcnt is converted to ASCII text format. In the next step, the to the taxonomy similarity score. In the preferred 

converted document is identified 604, by matching with s embodiment, two points are added to the similarity score for 

models from the Content Type Library 606. Once the incoming links. 

document has been matched with a Content Type model, the Thus, the similarity score for each taxonomy category 

document is formatted according to the model, as illustrated being checked against the web page for a fit is adjusted. The 

in flow chart box 608. Resources fields in the document are score is adjusted upward for each incoming link to the web 

then extracted from the document 610. The extracted 10 page being processed, and the score is adjusted upward by 

resource links are then provided to the Classifier 612. a lesser amount for each link that would itself be placed in 

Operation of the Classifier the same taxonomy category. After the scores have been 

FIG. 8 is a flow diagram that illustrates the operations adjusted in this manner, the score for the taxonomy category 

performed by the Classifier module of the Back-End com- is sorted in the Data store of the Back-End component. The 

ponent illustrated in FIG. 4. The Classifier receives 15 Classifier then checks for additional taxonomy categories to 

resources, such as web pages, web documents and the like, process at the decision box numbered 814. 

extracted by a Harvester module and then determines the If there are additional taxonomy categories, an affirmative 

most appropriate taxonomy location for the resource. The outcome at the decision box 814, then processing moves to 

resource includes the link address and a link description. The the comparison operation 804. If all taxonomy categories 

resource may also contain additional links that the Harvester 20 have been processed, a negative outcome at 814, then 

retrieved. In the preferred embodiment, the Classifier uses processing moves to category selection at the flow diagram 

the Data store of the Back-End component to determine a box numbered 816. At category selection, the Classifier 

taxonomy location for the resource being processed. The selects the taxonomy category with the highest adjusted 

Classifier retrieves a model of an exemplary classification similarity score and assigns the web page to that category 

from a Classifier Content Type Library to assist in identifi- 25 location. Alternatively, the Classifier may choose to assign 

cation of appropriate categories for the resource. As the web page to all taxonomy categories with a similarity 

described below, Classifier programming compares the score greater than a predetermined threshold value. This 

stored data to corresponding taxonomy categorizations, aspect of Classifier operation will depend on the design of 

looking for matches between the stored data arid the new the database and the resources available. It should be appar- 

links, and make corresponding categorizations. Other tech- 30 ent that a greater number of categories will result in more 

niques may also be used. For example, the Classifier may be "hits" on a given search query, and will result in more cross 

implemented with neural network learning techniques that references between search terms. If no similarity score is 

can "learn** from prior data, greater than a predetermined minimum score, then the web 

The first operation of the Classifier, represented by the page is assigned to an "Unknown" taxonomy category. Such 

flow diagram box numbered 802, is to receive a resource 35 assignments can then be reviewed by a human operator for 

page from the Harvester. reclassification, if desired. This completes the operation for 

In the next step 804, the resource page and links that it step 816, and other system operations may then continue, 

may contain are scored, and compared for internal consis- The Front-End Component 

tency. Ideally the page score and the links score should be FIG. 9 is a block diagram representation of the organiza- 

similar, indicating that they are directed to the same topic. In 40 tion of the Front-End component 108 illustrated in FIG. 1. 

the next operation, the Classifier compares the resource page The Front-End component permits an user at a network node 

against every harvested resource (page) in a taxonomy to search the database created by the Back-End component, 

category and assigns each comparison a similarity score. Such searches will efficiently identify resources, such as web 

That is, each taxonomy category will be assigned a similar- documents, web pages and the like, that match the user 

ity score that indicates the similarity between that category 45 query. The user can then request such resources using 

heading and the resource (page) being processed. The com- conventional methods, such as web browser (http) requests 

parison may be implemented using, for example, a "Naive for file transfer protocol (ftp) requests. Such a split between 

Bayes" comparison technique, which will be known to those the Back-End component for database creation and the 

skilled in the art. This comparison operation is represented Front-End component for database access permits a greater 

by the flow diagram box numbered 806, the Classifier 50 amount of user customization at the Front-End. This can 

compares the descriptions of the linked pages with the provide even greater efficiencies. 

description of the page being processed, again using the In the preferred embodiment, the Front- End component 

"Naive Bayes" technique, and assigns each comparison a 108 includes a user interface 902 that permits convenient 

similarity score. A typical web page, for example, may communication between the Front-End and network user, 

contain five or six links. 55 For example, the system may be designed so that Internet 

Using a predetermined concatenation formula, the Clas- users access the Front-End through an Internet web portal 

sifier combines the score from the comparisons of step 804 site. The user interface 902 then comprises the portal site 

and step 806 to produce a priority value. This operation is web design. The Front -End also has a network access 

indicated by the flow diagram box numbered 808. An component 904, which enables communication between the 

exemplary formula may be, for example, as follows. Priority 60 Front-End and the user, and the Front-End and Back-End for 

Value«3*(step 804 score)+1.5*(step 806 score). It is data collection and database management functions (FIG. 1). 

expected that the formula for the priority value will be Typically the Front-End accesses the Back-End using a 

determined experimentally, depending on the results standard internet browser, such as, Microsoft Internet 

obtained and the characteristics of the documents being Explorer™, Netscape Navigator™, or the like. This is 

harvested. The formula above may serve as a starting point. 65 particularly beneficial for a primary service provider at the 

In the next processing operation, the similarity score for Back-End 102, because the primary service provider does 

the page being processed is adjusted. The adjustment opera- not have to provide its Front-End client with additional 
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software or protocols to initiate and maintain communica- characterize the information being sought, and permits users 

tion with the Front-End, thereby eliminating the need to to view the taxonomy hierarchy and travel among the 

provide software support and update for the Front-End client different taxonomy categories. This processing is repre- 

by the primary service provider. An optional search engine sented by the flow diagram box numbered 1008. 

component 906 may be included with the Front-End, if 5 FIG. 12 is a block diagram illustrating an the applications 

desired. The search engine 906 may be specially adapted to and files of an embodiment of the present invention, which 

search the database. Alternatively, a conventional search enables the client to manage a database over the Internet. As 

engine such as those mentioned above may be used to search used in the present application, the term "management" 

the database. Finally, as described above, the database 106 refers to the processes and functions associated with 

may be optionally stored at the Front-End. Although illus- ao organizing, revising and updating the objects that comprise 

trated in FIG. 9 as being part of the Front-End, it should be the database, such as, resources (including, documents, web 

understood that the database 106 may be stored at any documents, and web pages), directories, and sub-directories, 

network location that can be accessed by the system user 110 and the database itself. The processes and functions, include 

(FIG. 1) through the network access component 904. but are not limited to copying, moving, deleting, creating a 

A particularly advantageous configuration is where the 15 new directory, creating a new resource, "Empty Trash", 

database 106 is stored at the Back-End 102. The configu- logging out, accessing help files, renaming resources, 

ration alleviates the Front-End 108 client from having to renaming directories, initiating a crawl for a new database, 

store the database on its storage devices. Further in the or initiating an existing crawl taxonomy for updating an 

instance for a Back-End primary service provider, where the existing database. Those of ordinary skill in the art would 

primary service provider is providing database services to a 20 understand and appreciate the aforementioned functions and 

plurality of users/clients, storage of the databases at its processes, and their application. In this embodiment, The 

location allows the primary service provider the benefit of Front End component 1210, which resides with the client, 

maintaining the databases from a centralized location. For includes a browser application 1212, and a client identifier 

example, maintenance, updates and any revisions to the file 1214. Typically this is a file that resides in the client's 

software or the database structures can be efficiently accom- 25 computer, known as a "cookie", which contains information 

phshed at one location by the primary service provider, indicating that the computer accessing the Back End com- 

Another preferred embodiment is directed to the instance ponent is authorized to access and manage the client's 

where the Front-End is with a secondary service provider, databases. Alternatively, the Back End component may 

that is, a client to the primary service provider. The user require the computer seeking access to transmit "user name" 

interface 902 then comprises an application that enables the 30 and "password" or like information to verify its identity and 

client to accesses the Back-End component 102 at the authorization. The Back End component 1220 includes a 

primary service provider's location. The client is able to server engine application 1222, a client identifier table 1224, 

initiate generation of new databases, initiate updates of client interface application 1226, and a client database 1228. 

existing databases, develop taxonomies for organizing Those of skill in the art would appreciate that the databases 

retrieved resources, and manually placing retrieved 35 can be organized as individual data structures, or a subset of 

resources into specific categories of the taxonomy. The data structures within a larger data structure without chang- 

graphical user interface (GUI) used by the client is com- ing the operation of the present invention. The server engine 

prised of a multi pane and multi control frame display. From application receives a requests and instructions from the 

the GUI the client can inspect the taxonomy or hierarchy tree client to access the client's databases. The server and client 

in which the retrieved resources are organized. The GUI will 40 interact by exchanging information via communications link 

also have panes where the resources stored in a branch/ 1230, which may include transmission over the Internet. The 

directory can be displayed, as well as, any other sub- Back End component verifies that the user is authorized to 

branches/sub-directories that are organized under said access the client's database, either through the client iden- 

branch. In addition, the GUI will have a series control tifier file 1214, or by verification of user name and password, 

implements where such routine maintenance functions can 45 FIG. 13 illustrates the Client Interface application of one 

be initiated, including but not limited to, copying, moving, embodiment of the invention, which displays the status and 

deleting, creating new branches/directories, creating new procedures that may be initiated by the client. This example 

resources, refreshing the display, finalizing resources tagged display is sent from the server system 1222 to the client 

for deletion, logging out, and requesting help. Those skilled system 1210, and it displays the status and taxonomy of the 

in the art will be familiar with the multiple ways in which a 50 client's database. The display illustrated in FIG. 13 contains 

hierarchy may be represented for computer use, such as a Taxonomy section 1301, a Resource section 1303, and a 

linked lists and tables, and the typical functions used in Control Bar section 1302. Those skilled in the art would 

managing such hierarchies. appreciate that these various sections can be omitted or 

FIG. 10 is a flow diagram that illustrates the processing rearranged or adapted in various ways, while still maintain- 
performed by the Front-End component 108, where the 55 ing their overall functionality. The Taxonomy section 1301 
Front-End component is one that is accessed by a user of the provides a graphical and textual representation of the tax- 
database. In the first processing operation, represented by onomy of the information contained in the database. The 
the flow diagram box numbered 1002, the Front-End carries resources in the database are typically organized according 
out a user authorization. This operation ensures proper data to directories and sub-directories, which correspond to orga- 
access security and recordation of financial charges, if any, 60 nizing the resources according to genus and sub-genus 
Next, the Front-End receives a user database query at the categories. Those of skill in the art would readily appreciate 
flow diagram box numbered 1004. The Front-End then this type of organization regime, and the nomenclature 
applies that query to the database, as indicated by the flow associated with their use. Information gathered by the 
diagram box numbered 1006. Lastly, the Front-End returns present invention can be automatically assigned to a tax- 
the results to the user and may also permit user browsing of 65 onomy generated by the Classifier component of the present 
the taxonomy hierarchy. The browsing operation is espe- invention, as disclosed herein. Alternatively, the client can 
cially useful to users who are not certain of how best to configure the invention so that certain types of resources, or 
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all resources are manually ordered into a taxonomy by the 
client. The Taxonomy section provides a toggle box 1301a, 
which designates that an action is to be performed on the 
associated directory or sub -directory; and a toggle box 
13016, which toggles a specified directory to expand all of 5 
its sub-directories, or to collapse only to the parent directory. 
The Resource section 1302 provides detailed information 
regarding a specific directory. Within the Resource section is 
a sub -section 1302a for displaying detailed information 
relating to the resources that are classified in this directory, 
and a sub -section 13026 for displaying sub-directories that 
are associated with this directory. The Control bar section 
1303 provides buttons that dictate and initiate actions that 
are to be performed on the directories, sub-directories or 
resources that have been tagged in the Taxonomy or 
Resource sections. In the present example, some of the 35 
actions that can be performed are copy 1303a, move 13036, 
delete 1303c, new directory 13030*, new resource 1303e, 
empty trash 1303/, log out 1303g, help 1303/i, updating an 
existing database 1303/, and generating a new database 
1303/. Those of skill in the art would understand the 20 
operation of these functions and appreciate that any of these 
functions can be omitted or rearranged or adapted in various 
ways. Those of skill in the art would also understand that the 
functions are available or desirable for managing files and 
directories are not limited to those illustrated above. 25 

FIG. 14 provides further illustration of the Resource 
section 1302 of the Client Interface Page 1226. When a 
directory is selected in the Taxonomy section 1301, the 
resources and sub-directories associated with this directory 
is displayed in the Resource section 1302. Resources are 30 
finks on the Web that have been identified as of being 
relevant to the search criteria for the database. Each resource 
can have one or more properties that describe the data the 
resource contains. The Resource section displays and man- 
ages this information for the client. The Resource section 35 
can have three sub-sections, Resources 1401, Viewing infor- 
mation and Control 1402, and Sub-directories 1403. The 
Resources sub-section 1401 displays information about the 
properties of the resource in a tabular form with the indi- 
vidual resources listed as rows and properties, such as, the 40 
resource's name 1401a, type 14016, date last updated 
1401c, date created 1401d, and a description 1401e, as 
columns. The display provides for the sorting of the 
resources in ascending or descending order according to the 
various properties by clicking on the column header of the 45 
desired property. Each resource has an associated toggle box 
1401/, which can be toggled to indicate that a specific action 
is to be performed on the resource. The Viewing and Control 
sub-section 1402 displays information regarding the number 
of resources being displayed in the Resources sub -section 50 
1402a. For example, the View portion can display the 
current number of resources being viewed out of the total 
number available. The Viewing and Control sub-section 
1402 also provides control boxes 14026 for setting the 
number of resources displayed. The Sub-directories sub- 55 
section 1403 displays any sub-directories 1403a that are 
associated with the directory being viewed. Each sub- 
directory has an associated toggle box 14036, which can be 
toggled to indicate that a specific action is to be performed 
on the sub-directory. 60 

It is evident to those skilled in the art, the present 
invention provides an advantageous method of permitting a 
secondary service provider the ability to review and orga- 
nize the retrieved resources and to refine the search param- 
eters used by the Spider for updating the database, thereby 65 
improving the efliciency of the Spider without the interven- 
tion of the primary service provider. 
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Further the present invention, provides a method for a 
primary service provider to provide database services at 
improved efficiencies. For example, the method of updating 
the retrieval priority list during the course of a crawl results 
in the Spider at any given point always retrieving the most 
relevant documents, versus, automatically retrieving all the 
links regardless of relevancy; resulting in a higher ratio of 
relevant resources retrieved to overall number of resources 
retrieved. This provides the primary service provider with a 
better product to its client. This is also accomplished using 
minimal computer time/resources, which provides in 
increased economy and efficiency to the primary service 
provider. In addition, the present invention permits the 
client/secondary service provider to review and revise the 
results of a crawl without the need for human intervention 
from the primary service provider; and thereby providing 
additional instances of economy to the primary service 
provider. 

Thus, the system described above provides an efficient 
technique for indexing web pages and creating an database 
that will provide more relevant search results and more 
efficient operation. These efficiencies are obtained through 
specialized components, such as the Spider, Harvester and 
Classifier described above. 

The present invention has been described above in terms 
of a presently preferred embodiment so that an understand- 
ing of the present invention can be conveyed. 

There are, however, many configuration for HTML docu- 
ment retrieval and indexing systems not specifically 
described herein but with which the present invention is 
applicable. The present invention should therefore not be 
seen as limited to the particular embodiments described 
herein, but rather, it should be understood that the present 
invention has wide applicability with respect to HTML 
document retrieval and indexing systems generally. All 
modifications, variations, or equivalent arrangements and 
implementations that are within the scope of the attached 
claims should therefore be considered within the scope of 
the invention. 

We claim: 

1. An automated method of creating or updating a data- 
base of resumes and related documents from a network of 
documents, said method comprising, 

a) entering at least one example document that is relevant 
to a subject taxonomy in a retrieval priority list, if there 
is a plurality of example documents stored in said 
retrieval priority list, ranking said example documents 
according to the relevancy of said example documents 
to said subject taxonomy; 

b) retrieving a document from a network of documents, 
where said document is the most relevant document to 
said subject taxonomy stored in said retrieval priority 
list; 

c) harvesting information from specified fields of said 
document; 

d) classifying said information into one or more classes 
according to specified categories of said subject tax- 
onomy; 

e) storing said information into a database; 

f) determining whether said information are links to other 
documents; 

g) ranking said link's according to relevancy to said 
subject taxonomy, and storing said links in said 
retrieval priority list according to said relevancy; 

h) terminating said method, provided said method's stop 
criteria have been met; and 
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i) repeating steps b) through h), provided said method's 
stop criteria has not been met. 

2. The method of claim 1, wherein in step c) said specified 
fields is according to a Harvester Content Type Model. 

3. The method of claim 1, wherein in step d) said specified 5 
categories is according to a Classifier Content Type Model. 

4. The method of claim 1, wherein in step g) said link's 
relevancy is determined according to said Classifier Content 
Type Model. 

5. The method of claim 1, wherein in step g) said link's 30 
relevancy is determined according to a Directed Graph 
Cluster Module. 

6. The method of claim 1 further comprising: 

a) receiving a topic; 

b) applying the topic to the subject taxonomy of the 15 
database. 

7. A computer system for creating or updating a database 
of resumes and related documents, said computer system 
comprising: 

a central processing unit that can establish communication 

with a network of documents; and 
a program memory that stores programming instructions 

executed by said central processing unit, wherein said 

computer system executing said programming instruc- 2 5 

tions performs a process comprising, 

a) entering at least one example document that is 
relevant to a subject taxonomy in a retrieval priority 
list, it there is a plurality of example documents 
stored in said retrieval priority list, ranking said 30 
example documents according to the relevancy of 
said example documents to said subject taxonomy; 

b) retrieving a document from a network of documents, 
where said document is the most relevant document 

to said subject taxonomy stored in said retrieval 35 
priority list; 

c) harvesting information from specified fields of said 
document; 

d) classifying said information into one or more classes 
according to specified categories of said subject 40 
taxonomy; 

e) storing said information into a database; 

f) determining whether said information are links to 
other documents; 

g) ranking said link's according to relevancy to said 45 
subject taxonomy, and storing said links in said 
retrieval priority list according to said relevancy; 

h) terminating said method, provided said method's 
stop criteria have been met; and 

i) repeating steps d) through j), provided said method's 50 
stop criteria has not been met. 

8. A computer-readable medium having computer- 
executable instructions for performing a method compris- 
ing: 

a) entering at least one example document that is relevant 55 
to a subject taxonomy in a retrieval priority list, it there 

is a plurality of example documents stored in said 
retrieval priority list, ranking said example documents 
according to the relevancy of said example documents 
to said subject taxonomy; 60 

b) retrieving a document from a network of documents, 
where said document is the most relevant document to 
said subject taxonomy stored in said retrieval priority 
list; 

c) harvesting information from specified fields of said 65 
document; 
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d) classifying said information into one or more classes 
according to specified categories of said subject tax- 
onomy; 

e) storing said information into a database; 

f) determining whether said information are links to other 
documents; 

g) ranking said link's according to relevancy to said 
subject taxonomy, and storing said links in said 
retrieval priority list according to said relevancy; 

h) terminating said method, provided said method's stop 
criteria have been met; and 

i) repeating steps b) through h), provided said method's 
stop criteria has not been met. 

9. An automated method of creating or updating a data- 
base of resumes and related documents, said method com- 
prising: 

a) training a spider to retrieve relevant documents from 
example documents within a retrieval priority list and 
ranking said example documents according to the rel- 
evancy of said example documents to subject taxonomy 
from a network of documents; 

b) retrieving said relevant documents from said network 
of documents; 

c) extracting information from said retrieved relevant 
documents; 

d) classifying said extracted information; 

e) storing said extracted information into a database; 

f) determining whether said information are links to other 
documents; 

g) ranking said links according to relevancy to said 
taxonomy, and storing said links in said retrieval pri- 
ority list according to said relevancy; 

h) terminating said method, provided that said method's 
stop criteria have been met; and 

i) repeating steps b) through h), provided said method's 
stop criteria has not been met. 

10. A database of resumes and related documents created 
from a method comprising: 

a) training a spider to retrieve relevant documents from 
example documents within a retrieval priority list and 
ranking said example documents according to the rel- 
evancy of said example documents to subject taxonomy 
from a network of documents; 

b) retrieving said relevant documents from said network 
of documents; 

c) extracting information from said retrieved relevant 
documents; 

d) classifying said extracted information; 

e) storing said extracted information into a database; 

f) determining whether said information are links to other 
documents; 

g) ranking said links according to relevancy to said 
taxonomy, and storing said links in said retrieval pri- 
ority list according to said relevancy; 

h) terminating said method, provided that said method's 
stop criteria have been met; and 

i) repeating steps b) through h), provided said method's 
stop criteria has not been met. 

* * * * * 



04/19/2003, EAST Version: 1.03.0002 



